In [1]:
import pandas as pd

Identify your problem statement, find all your datasets, identify the questions you want to answer, reach out to polling/consulting firms to work with.

Potential question--Why did these counties flip to Trump?

Explore your data to understand it--drop data that is not relevant

Look to predict something (next presidential election outcome).

Think about what would happen if more people became UNINSURED and the result that could have.

Should slcie by margin of county flip. First-fourth quartiles

Look at population counts per county.

Margin of victory/voting which way (Trump/Clinton) is more important to predict than simply whcih flipped (make that a subset)

A listing of the specific counties that flipped: http://www.npr.org/2016/11/15/502032052/lots-of-people-voted-for-obama-and-trump-heres-where-in-3-charts

Nate Silver postulates that education level is a key predctor. http://fivethirtyeight.com/features/education-not-income-predicted-who-would-vote-for-trump/?ex_cid=story-twitter

Daily Kos article: http://www.dailykos.com/story/2017/1/30/1627319/-Daily-Kos-Elections-presents-the-2016-presidential-election-results-by-congressional-district

Diversity Index scource: https://www.kaggle.com/mikejohnsonjr/us-counties-diversity-index


In [2]:
election = pd.read_csv('2016_election.csv')

In [3]:
prev_election = pd.read_csv('2012_election.csv')

In [4]:
div = pd.read_csv('diversityindex.csv')

In [5]:
edu = pd.read_excel('education_25_older_filt.xls')

Change in education the past 10 years--find the difference between them for each county


In [6]:
pop = pd.read_excel('us county populations.xls')

In [56]:
ue_rates = pd.read_excel('Unemployment Rates.xlsx')
ue_rates = ue_rates.drop(ue_rates[[0,1,2,4,5]],axis=1)
ue_rates = ue_rates.rename(columns={'Unnamed: 3':'county_state','Unnamed: 6':'labor_force', 'Unnamed: 7':'employed','Unnamed: 8':'unemployed','Unnamed: 9':'ue_rate'})
ue_rates = ue_rates.drop(ue_rates.index[[0,1,2,3,4]])

In [7]:
len(edu)


Out[7]:
3283

In [8]:
len(pop)


Out[8]:
3145

In [9]:
pop.dtypes


Out[9]:
state              object
county             object
est_pop_2015        int64
pop_change_2015     int64
int_mig_2015        int64
dom_mig_2015        int64
mig_2015            int64
dtype: object

In [10]:
div.head()


Out[10]:
Location Diversity-Index Black or African American alone, percent, 2013 American Indian and Alaska Native alone, percent, 2013 Asian alone, percent, 2013 Native Hawaiian and Other Pacific Islander alone, percent, Two or More Races, percent, 2013 Hispanic or Latino, percent, 2013 White alone, not Hispanic or Latino, percent, 2013
0 Aleutians West Census Area, AK 0.769346 7.4 13.8 31.1 2.3 4.8 14.6 29.2
1 Queens County, NY 0.742224 20.9 1.3 25.2 0.2 2.7 28.0 26.7
2 Maui County, HI 0.740757 0.8 0.6 28.8 10.6 23.3 10.7 31.5
3 Alameda County, CA 0.740399 12.4 1.2 28.2 1.0 5.2 22.7 33.2
4 Aleutians East Borough, AK 0.738867 7.7 21.8 41.4 0.7 3.7 13.5 12.9

In [11]:
div = div.rename(columns={'Location':'county_state','Diversity-Index':'div_index','Black or African American alone, percent, 2013':'af_am','American Indian and Alaska Native alone, percent, 2013':'native_2013','Asian alone, percent, 2013':'asian_am','Native Hawaiian and Other Pacific Islander alone, percent,':'pac_am','Two or More Races, percent, 2013':'two_or_more_races','Hispanic or Latino, percent, 2013':'hisp_lat_am','White alone, not Hispanic or Latino, percent, 2013':'white_am'})

In [12]:
div.head()


Out[12]:
county_state div_index af_am native_2013 asian_am pac_am two_or_more_races hisp_lat_am white_am
0 Aleutians West Census Area, AK 0.769346 7.4 13.8 31.1 2.3 4.8 14.6 29.2
1 Queens County, NY 0.742224 20.9 1.3 25.2 0.2 2.7 28.0 26.7
2 Maui County, HI 0.740757 0.8 0.6 28.8 10.6 23.3 10.7 31.5
3 Alameda County, CA 0.740399 12.4 1.2 28.2 1.0 5.2 22.7 33.2
4 Aleutians East Borough, AK 0.738867 7.7 21.8 41.4 0.7 3.7 13.5 12.9

In [13]:
len(div)


Out[13]:
3195

In [14]:
election.head()


Out[14]:
Unnamed: 0 votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name combined_fips
0 0 93003.0 130413.0 246588.0 0.377159 0.52887 37,410 15.17% AK Alaska 2013
1 1 93003.0 130413.0 246588.0 0.377159 0.52887 37,410 15.17% AK Alaska 2016
2 2 93003.0 130413.0 246588.0 0.377159 0.52887 37,410 15.17% AK Alaska 2020
3 3 93003.0 130413.0 246588.0 0.377159 0.52887 37,410 15.17% AK Alaska 2050
4 4 93003.0 130413.0 246588.0 0.377159 0.52887 37,410 15.17% AK Alaska 2060

In [15]:
election.county_name.count()


Out[15]:
3141

In [16]:
#Need to drop Alaska as it doesn't have any county names
election = election[election.county_name!='Alaska']
pop = pop[pop.county!='Alaska']

In [17]:
election.head()


Out[17]:
Unnamed: 0 votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name combined_fips
29 29 5908.0 18110.0 24661.0 0.239569 0.734358 12,202 49.48% AL Autauga County 1001
30 30 18409.0 72780.0 94090.0 0.195653 0.773515 54,371 57.79% AL Baldwin County 1003
31 31 4848.0 5431.0 10390.0 0.466603 0.522714 583 5.61% AL Barbour County 1005
32 32 1874.0 6733.0 8748.0 0.214220 0.769662 4,859 55.54% AL Bibb County 1007
33 33 2150.0 22808.0 25384.0 0.084699 0.898519 20,658 81.38% AL Blount County 1009

In [18]:
election = election.drop(election[[0,10]], axis=1)

In [19]:
election['county_state'] = election['county_name'] + ', ' + election['state_abbr']

In [20]:
prev_election['county_state'] = prev_election['county_name'] + ', ' + prev_election['state_abbr']

In [21]:
pop.head()


Out[21]:
state county est_pop_2015 pop_change_2015 int_mig_2015 dom_mig_2015 mig_2015
0 AL Alabama 4858979 12568 5726 -2268 3458
1 AL Autauga County 55347 57 19 -140 -121
2 AL Baldwin County 203709 3996 221 3469 3690
3 AL Barbour County 26489 -326 0 -281 -281
4 AL Bibb County 22583 34 21 4 25

In [22]:
pop['county_state'] = pop['county'] + ', ' + pop['state']

In [23]:
election.head()


Out[23]:
votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name county_state
29 5908.0 18110.0 24661.0 0.239569 0.734358 12,202 49.48% AL Autauga County Autauga County, AL
30 18409.0 72780.0 94090.0 0.195653 0.773515 54,371 57.79% AL Baldwin County Baldwin County, AL
31 4848.0 5431.0 10390.0 0.466603 0.522714 583 5.61% AL Barbour County Barbour County, AL
32 1874.0 6733.0 8748.0 0.214220 0.769662 4,859 55.54% AL Bibb County Bibb County, AL
33 2150.0 22808.0 25384.0 0.084699 0.898519 20,658 81.38% AL Blount County Blount County, AL

In [24]:
edu.head()


Out[24]:
FIPS Code State Area name less_hs_diploma_2000 hs_diploma_only_2000 less_4_years_2000 four_or_ higher_2000 per_less_high_school diploma_2000 per_hs_diploma_only_2000 per_less_4_years_2000 per_four_or_ higher_2000 less_high_school_diploma_2011_15 hs_diploma_only_2011_15 less_4_years_2011_15 four_or_ higher_2011_15 per_less_high_school_diploma_2011_15 per_hs_diploma_only_2011_15 per_less_4_years_2011_15 per_four_or_higher_2011_15
0 0 US United States 35715625.0 52168981.0 49864428.0 44462605.0 19.6 28.6 27.4 24.4 28229094.0 58722528.0 61558628.0 62952272.0 13.3 27.8 29.1 29.8
1 1000 AL Alabama 714081.0 877216.0 746495.0 549608.0 24.7 30.4 25.9 19.0 509891.0 1005295.0 962515.0 761650.0 15.7 31.0 29.7 23.5
2 1001 AL Autauga County 5872.0 9332.0 7413.0 4972.0 21.3 33.8 26.9 18.0 4656.0 12182.0 11044.0 8437.0 12.8 33.5 30.4 23.2
3 1003 AL Baldwin County 17258.0 28428.0 28178.0 22146.0 18.0 29.6 29.3 23.1 14360.0 39431.0 43500.0 39710.0 10.5 28.8 31.8 29.0
4 1005 AL Barbour County 6679.0 6124.0 4025.0 2068.0 35.3 32.4 21.3 10.9 5021.0 6490.0 4943.0 2354.0 26.7 34.5 26.3 12.5

In [25]:
pop.head()


Out[25]:
state county est_pop_2015 pop_change_2015 int_mig_2015 dom_mig_2015 mig_2015 county_state
0 AL Alabama 4858979 12568 5726 -2268 3458 Alabama, AL
1 AL Autauga County 55347 57 19 -140 -121 Autauga County, AL
2 AL Baldwin County 203709 3996 221 3469 3690 Baldwin County, AL
3 AL Barbour County 26489 -326 0 -281 -281 Barbour County, AL
4 AL Bibb County 22583 34 21 4 25 Bibb County, AL

In [26]:
edu['county_state'] = edu['Area name'] + ', ' + edu['State']

In [27]:
edu.dtypes


Out[27]:
FIPS Code                                 int64
State                                    object
Area name                                object
less_hs_diploma_2000                    float64
hs_diploma_only_2000                    float64
less_4_years_2000                       float64
four_or_ higher_2000                    float64
per_less_high_school diploma_2000       float64
per_hs_diploma_only_2000                float64
per_less_4_years_2000                   float64
per_four_or_ higher_2000                float64
less_high_school_diploma_2011_15        float64
hs_diploma_only_2011_15                 float64
less_4_years_2011_15                    float64
four_or_ higher_2011_15                 float64
per_less_high_school_diploma_2011_15    float64
per_hs_diploma_only_2011_15             float64
per_less_4_years_2011_15                float64
per_four_or_higher_2011_15              float64
county_state                             object
dtype: object

In [28]:
edu.isnull().sum()


Out[28]:
FIPS Code                                0
State                                    0
Area name                                0
less_hs_diploma_2000                    11
hs_diploma_only_2000                    11
less_4_years_2000                       11
four_or_ higher_2000                    11
per_less_high_school diploma_2000       11
per_hs_diploma_only_2000                11
per_less_4_years_2000                   11
per_four_or_ higher_2000                11
less_high_school_diploma_2011_15        10
hs_diploma_only_2011_15                 10
less_4_years_2011_15                    10
four_or_ higher_2011_15                 10
per_less_high_school_diploma_2011_15    10
per_hs_diploma_only_2011_15             10
per_less_4_years_2011_15                10
per_four_or_higher_2011_15              10
county_state                             0
dtype: int64

In [29]:
edu = edu.dropna()

In [30]:
edu.isnull().sum()


Out[30]:
FIPS Code                               0
State                                   0
Area name                               0
less_hs_diploma_2000                    0
hs_diploma_only_2000                    0
less_4_years_2000                       0
four_or_ higher_2000                    0
per_less_high_school diploma_2000       0
per_hs_diploma_only_2000                0
per_less_4_years_2000                   0
per_four_or_ higher_2000                0
less_high_school_diploma_2011_15        0
hs_diploma_only_2011_15                 0
less_4_years_2011_15                    0
four_or_ higher_2011_15                 0
per_less_high_school_diploma_2011_15    0
per_hs_diploma_only_2011_15             0
per_less_4_years_2011_15                0
per_four_or_higher_2011_15              0
county_state                            0
dtype: int64

In [31]:
len(edu)


Out[31]:
3267

In [32]:
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.distplot(edu.per_less_high_school_diploma_2011_15, kde=False)
ax.set(xlabel='Percentage per county with less than a High School Diploma, 2011-2015', ylabel='Count')
ax.set_title('Education Across All US Counties', fontsize=16, fontname='Ubuntu')
plt.show()


/Applications/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family [u'Ubuntu'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

In [33]:
ax = sns.distplot(edu.per_hs_diploma_only_2011_15, kde=False)
ax.set(xlabel='Percentage per county with only High School Diploma, 2011-2015', ylabel='Count')
ax.set_title('Education Across All US Counties', fontsize=16, fontname='Ubuntu')
plt.show()



In [34]:
ax = sns.distplot(edu.per_less_4_years_2011_15, kde=False)
ax.set(xlabel='Percentage per county with less than four years of college, 2011-2015', ylabel='Count')
ax.set_title('Education Across All US Counties', fontsize=16, fontname='Ubuntu')
plt.show()



In [35]:
ax = sns.distplot(edu.per_four_or_higher_2011_15, kde=False)
ax.set(xlabel='Percentage per county with four or more years of college, 2011-2015', ylabel='Count')
ax.set_title('Education Across All US Counties', fontsize=16, fontname='Ubuntu')
plt.show()



In [36]:
election['per_dem'] = election['per_dem'].apply(lambda x: x*100)
election['per_gop'] = election['per_gop'].apply(lambda x: x*100)

In [37]:
prev_election['per_dem_2012'] = prev_election['per_dem_2012'].apply(lambda x: x*100)
prev_election['per_gop_2012'] = prev_election['per_gop_2012'].apply(lambda x: x*100)

In [38]:
election['per_point_diff'] = election['per_point_diff'].apply(lambda x: float(x.strip('%')))

In [39]:
# Making a new column for positive and negative--if per_dem is below 50%, negative. If
# above 50%, positive.

In [40]:
election.head()


Out[40]:
votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name county_state
29 5908.0 18110.0 24661.0 23.956855 73.435789 12,202 49.48 AL Autauga County Autauga County, AL
30 18409.0 72780.0 94090.0 19.565310 77.351472 54,371 57.79 AL Baldwin County Baldwin County, AL
31 4848.0 5431.0 10390.0 46.660250 52.271415 583 5.61 AL Barbour County Barbour County, AL
32 1874.0 6733.0 8748.0 21.422039 76.966164 4,859 55.54 AL Bibb County Bibb County, AL
33 2150.0 22808.0 25384.0 8.469902 89.851875 20,658 81.38 AL Blount County Blount County, AL

In [41]:
election['election_range'] = election['per_dem'] - election['per_gop']

In [42]:
prev_election['election_range'] = prev_election['per_dem_2012'] - prev_election['per_gop_2012']

In [43]:
prev_election.dtypes


Out[43]:
state_abbr              object
county_name             object
total_votes_2012         int64
votes_dem_2012           int64
votes_gop_2012           int64
county_fips              int64
state_fips               int64
per_dem_2012           float64
per_gop_2012           float64
diff_2012                int64
per_point_diff_2012    float64
county_state            object
election_range         float64
dtype: object

In [44]:
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.distplot(election.per_point_diff, kde=False)
ax.set(xlabel = "Percent Difference", ylabel='Count')
ax.set_title('2016 Election Margins in All US Counties', fontsize=16)
plt.show()



In [45]:
prev_election.head()


Out[45]:
state_abbr county_name total_votes_2012 votes_dem_2012 votes_gop_2012 county_fips state_fips per_dem_2012 per_gop_2012 diff_2012 per_point_diff_2012 county_state election_range
0 AL Autauga County 23909 6354 17366 1 1 26.575766 72.633736 11012 -0.460580 Autauga County, AL -46.057970
1 AL Baldwin County 84988 18329 65772 3 1 21.566574 77.389749 47443 -0.558232 Baldwin County, AL -55.823175
2 AL Barbour County 11459 5873 5539 5 1 51.252291 48.337551 334 0.029147 Barbour County, AL 2.914739
3 AL Bibb County 8391 2200 6131 7 1 26.218567 73.066381 3931 -0.468478 Bibb County, AL -46.847813
4 AL Blount County 23980 2961 20741 9 1 12.347790 86.492911 17780 -0.741451 Blount County, AL -74.145121

In [46]:
ax = sns.distplot(prev_election.per_point_diff_2012, kde=False)
ax.set(xlabel = "Percent Difference 2012", ylabel='Count')
ax.set_title('2016 Election Margins in All US Counties', fontsize=16)
plt.show()



In [47]:
import matplotlib.pyplot

In [48]:
ax = sns.distplot(election.election_range, kde=False)
ax.set(xlabel = "(negative=Republican, positive=Democrat, %)", ylabel='Count')
ax.set_title('Partisan Pattern per All US Counties, 2016', fontsize=16, fontname='Ubuntu')
plt.show()
# Democrats are in HUGE trouble. Of course, this distribution doesn't mean that they're 
# necessarily losing counties, but of those they held onto in 2016, they have a far, far
# weaker grasp on them than Republicans do on their side. Also, many of the Republican 
# counties are in Red States with few electoral votes. However, for Congressional voting
# this is still a dangerous sign.



In [49]:
# What was it like in 2012? 
ax = sns.distplot(prev_election.election_range, kde=False)
ax.set(xlabel = "(negative=Republican, positive=Democrat, %)", ylabel='Count')
ax.set_title('Partisan Degrees per County, 2012', fontsize=15, fontname='Ubuntu')
plt.show()
# It was already bad. But it's clearly gotten worse for Democrats.



In [50]:
election.describe()


Out[50]:
votes_dem votes_gop total_votes per_dem per_gop per_point_diff election_range
count 3.112000e+03 3112.000000 3.112000e+03 3112.000000 3112.000000 3112.000000 3112.000000
mean 2.006065e+04 19622.378856 4.174537e+04 31.708228 63.613409 39.233014 -31.905181
std 7.199807e+04 40442.737492 1.134048e+05 15.358601 15.651728 20.793041 30.883786
min 4.000000e+00 57.000000 6.400000e+01 3.144654 4.122067 0.040000 -91.636364
25% 1.166000e+03 3206.000000 4.820500e+03 20.475924 54.947846 22.467500 -54.689887
50% 3.153000e+03 7164.500000 1.094700e+04 28.473862 66.743096 40.315000 -38.217390
75% 9.608500e+03 17448.250000 2.879650e+04 39.999326 75.147062 55.462500 -14.876874
max 1.893770e+06 620285.000000 2.652072e+06 92.846592 95.272727 91.640000 88.724525

In [51]:
election['slight_dem'] = election['election_range'].apply(lambda x: 0< x <= 10)
election['slight_gop'] = election['election_range'].apply(lambda x: -10 <= x < 0)
election['med_dem'] = election['election_range'].apply(lambda x: 10< x <= 25)
election['med_gop'] = election['election_range'].apply(lambda x: -25 <= x < -10)
election['strong_dem'] = election['election_range'].apply(lambda x: 25 < x <= 50)
election['strong_gop'] = election['election_range'].apply(lambda x: -50 <= x < -25)

In [52]:
election.head()


Out[52]:
votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name county_state election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop
29 5908.0 18110.0 24661.0 23.956855 73.435789 12,202 49.48 AL Autauga County Autauga County, AL -49.478934 False False False False False True
30 18409.0 72780.0 94090.0 19.565310 77.351472 54,371 57.79 AL Baldwin County Baldwin County, AL -57.786162 False False False False False False
31 4848.0 5431.0 10390.0 46.660250 52.271415 583 5.61 AL Barbour County Barbour County, AL -5.611165 False True False False False False
32 1874.0 6733.0 8748.0 21.422039 76.966164 4,859 55.54 AL Bibb County Bibb County, AL -55.544124 False False False False False False
33 2150.0 22808.0 25384.0 8.469902 89.851875 20,658 81.38 AL Blount County Blount County, AL -81.381973 False False False False False False

In [53]:
#Combine the states and counties into a single column.
# Have to find a way to join the dfs by matching up those with the same 
# county names AND the same state (there are countys with the same name)
# Simply concatenating them won't work.

In [54]:
#http://www.cnbc.com/heres-a-map-of-the-us-counties-that-flipped-to-trump-from-democrats/

In [57]:
ue_rates.labor_force = ue_rates.labor_force.astype(float)
ue_rates.employed =  ue_rates.employed.astype(float)
ue_rates.unemployed =  ue_rates.unemployed.astype(float)
ue_rates.ue_rate =  ue_rates.ue_rate.astype(float)

In [58]:
ue_rates.dtypes


Out[58]:
county_state     object
labor_force     float64
employed        float64
unemployed      float64
ue_rate         float64
dtype: object

In [111]:
right = election.set_index('county_state')
left = ue_rates.set_index('county_state')
combined = left.join(right, lsuffix='', rsuffix='_r')
combined = combined_1.reset_index()

In [112]:
right = combined.set_index('county_state')
left = div.set_index('county_state')
combined_2 = left.join(right, lsuffix='', rsuffix = '_r')
combined_2 = combined_2.reset_index()

In [120]:
edu.head()


Out[120]:
FIPS Code State Area name less_hs_diploma_2000 hs_diploma_only_2000 less_4_years_2000 four_or_ higher_2000 per_less_high_school diploma_2000 per_hs_diploma_only_2000 per_less_4_years_2000 per_four_or_ higher_2000 less_high_school_diploma_2011_15 hs_diploma_only_2011_15 less_4_years_2011_15 four_or_ higher_2011_15 per_less_high_school_diploma_2011_15 per_hs_diploma_only_2011_15 per_less_4_years_2011_15 per_four_or_higher_2011_15 county_state
0 0 US United States 35715625.0 52168981.0 49864428.0 44462605.0 19.6 28.6 27.4 24.4 28229094.0 58722528.0 61558628.0 62952272.0 13.3 27.8 29.1 29.8 United States, US
1 1000 AL Alabama 714081.0 877216.0 746495.0 549608.0 24.7 30.4 25.9 19.0 509891.0 1005295.0 962515.0 761650.0 15.7 31.0 29.7 23.5 Alabama, AL
2 1001 AL Autauga County 5872.0 9332.0 7413.0 4972.0 21.3 33.8 26.9 18.0 4656.0 12182.0 11044.0 8437.0 12.8 33.5 30.4 23.2 Autauga County, AL
3 1003 AL Baldwin County 17258.0 28428.0 28178.0 22146.0 18.0 29.6 29.3 23.1 14360.0 39431.0 43500.0 39710.0 10.5 28.8 31.8 29.0 Baldwin County, AL
4 1005 AL Barbour County 6679.0 6124.0 4025.0 2068.0 35.3 32.4 21.3 10.9 5021.0 6490.0 4943.0 2354.0 26.7 34.5 26.3 12.5 Barbour County, AL

In [118]:
combined_2.columns


Out[118]:
Index([u'county_state', u'div_index', u'af_am', u'native_2013', u'asian_am',
       u'pac_am', u'two_or_more_races', u'hisp_lat_am', u'white_am', u'index',
       u'labor_force', u'employed', u'unemployed', u'ue_rate', u'votes_dem',
       u'votes_gop', u'total_votes', u'per_dem', u'per_gop', u'diff',
       u'per_point_diff', u'state_abbr', u'county_name', u'election_range',
       u'slight_dem', u'slight_gop', u'med_dem', u'med_gop', u'strong_dem',
       u'strong_gop'],
      dtype='object')

In [113]:
right = combined_2.set_index('county_state')
left = edu.set_index('county_state')
combined_3 = left.join(right, lsuffix='', rsuffix = '_r')
combined_3 = combined_2.reset_index()

In [117]:
combined_3.columns


Out[117]:
Index([u'level_0', u'county_state', u'div_index', u'af_am', u'native_2013',
       u'asian_am', u'pac_am', u'two_or_more_races', u'hisp_lat_am',
       u'white_am', u'index', u'labor_force', u'employed', u'unemployed',
       u'ue_rate', u'votes_dem', u'votes_gop', u'total_votes', u'per_dem',
       u'per_gop', u'diff', u'per_point_diff', u'state_abbr', u'county_name',
       u'election_range', u'slight_dem', u'slight_gop', u'med_dem', u'med_gop',
       u'strong_dem', u'strong_gop'],
      dtype='object')

In [114]:
right = combined_3.set_index('county_state')
left = pop.set_index('county_state')
combined_4 = left.join(right, lsuffix='', rsuffix = '_r')
combined_4 = combined_4.reset_index()

In [115]:
combined_4.isnull().sum()


Out[115]:
county_state          0
state                 0
county                0
est_pop_2015          0
pop_change_2015       0
int_mig_2015          0
dom_mig_2015          0
mig_2015              0
level_0               7
div_index             7
af_am                 7
native_2013           7
asian_am              7
pac_am                7
two_or_more_races     7
hisp_lat_am           7
white_am              7
index                21
labor_force          21
employed             21
unemployed           21
ue_rate              21
votes_dem            43
votes_gop            43
total_votes          43
per_dem              43
per_gop              43
diff                 43
per_point_diff       43
state_abbr           43
county_name          43
election_range       43
slight_dem           43
slight_gop           43
med_dem              43
med_gop              43
strong_dem           43
strong_gop           43
dtype: int64

In [71]:
combined_4.dropna(inplace=True)

In [76]:
combined_4= combined_4[combined_4.county_name!='Alaska']
#Just making sure Alaska isn't included

In [77]:
combined_4.head()


Out[77]:
county_state state county est_pop_2015 pop_change_2015 int_mig_2015 dom_mig_2015 mig_2015 level_0 div_index ... per_point_diff state_abbr county_name election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop
0 Abbeville County, SC SC Abbeville County 24932 6 22 -12 10 4.0 0.445417 ... 28.25 SC Abbeville County -28.254383 False False False False False True
1 Acadia Parish, LA LA Acadia Parish 62577 79 32 -281 -249 5.0 0.355956 ... 56.67 LA Acadia Parish -56.674943 False False False False False False
2 Accomack County, VA VA Accomack County 32973 -25 81 -53 28 6.0 0.539878 ... 11.71 VA Accomack County -11.710568 False False False True False False
3 Ada County, ID ID Ada County 434211 7364 933 3838 4771 7.0 0.256622 ... 9.24 ID Ada County -9.239878 False True False False False False
4 Adair County, IA IA Adair County 7228 -189 0 -161 -161 8.0 0.054921 ... 35.36 IA Adair County -35.355148 False False False False False True

5 rows × 38 columns


In [78]:
election.describe()


Out[78]:
votes_dem votes_gop total_votes per_dem per_gop per_point_diff election_range
count 3.112000e+03 3112.000000 3.112000e+03 3112.000000 3112.000000 3112.000000 3112.000000
mean 2.006065e+04 19622.378856 4.174537e+04 31.708228 63.613409 39.233014 -31.905181
std 7.199807e+04 40442.737492 1.134048e+05 15.358601 15.651728 20.793041 30.883786
min 4.000000e+00 57.000000 6.400000e+01 3.144654 4.122067 0.040000 -91.636364
25% 1.166000e+03 3206.000000 4.820500e+03 20.475924 54.947846 22.467500 -54.689887
50% 3.153000e+03 7164.500000 1.094700e+04 28.473862 66.743096 40.315000 -38.217390
75% 9.608500e+03 17448.250000 2.879650e+04 39.999326 75.147062 55.462500 -14.876874
max 1.893770e+06 620285.000000 2.652072e+06 92.846592 95.272727 91.640000 88.724525

In [80]:
# Set up range variables
ax = sns.distplot(combined_4.election_range, kde=False)
ax.set(xlabel = "(negative=Republican, positive=Democrat, %)", ylabel='Count')
ax.set_title('Partisan Pattern per All US Counties, 2016', fontsize=16, fontname='Ubuntu')
plt.show()


Visualizations were a little dense as far as the content they were showing, make sure you slow down better explain or use visualizations whose labels make them a lot easier for the to understand.

  • Be sure to frame your results in a positive light! Saying “it’s not a great score” and “score of only 66,” you convey to your audience a sense of disappointment and they likely come away with a negative outlook. If you frame it in a positive way, then you have more control about what your audience takes away!
  • You should seek to include additional advanced classification metrics: ROC AUC and confusion matrices are critical to the assessment of a model, and omitting them due to technical errors is insufficient for a capstone presentation

In [ ]:
# All counties, not including those in Alaska.

In [83]:
VA = combined_4[combined_4.state_abbr=='VA']
VA.head()


Out[83]:
county_state state county est_pop_2015 pop_change_2015 int_mig_2015 dom_mig_2015 mig_2015 level_0 div_index ... per_point_diff state_abbr county_name election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop
2 Accomack County, VA VA Accomack County 32973 -25 81 -53 28 6.0 0.539878 ... 11.71 VA Accomack County -11.710568 False False False True False False
30 Albemarle County, VA VA Albemarle County 105703 1352 410 675 1085 33.0 0.379061 ... 25.06 VA Albemarle County 25.056116 False False False False True False
37 Alexandria city, VA VA Alexandria city 153511 2071 2334 -2139 195 40.0 0.641095 ... 59.03 VA Alexandria city 59.026135 False False False False False False
45 Alleghany County, VA VA Alleghany County 15677 -207 2 -85 -83 48.0 0.154440 ... 37.07 VA Alleghany County -37.065426 False False False False False True
56 Amelia County, VA VA Amelia County 12903 118 8 123 131 59.0 0.420531 ... 36.30 VA Amelia County -36.304193 False False False False False True

5 rows × 38 columns

Virginia EDA


In [84]:
ax = sns.distplot(VA.election_range, kde=False)
ax.set(xlabel = "Negative=Republican, Positive=Democrat (%)", ylabel='Count')
ax.set_title('Partisan Degree in Virginia Counties, 2016', fontsize=16, fontname='Ubuntu')
plt.show()



In [85]:
import matplotlib.pyplot as plt
import seaborn as sns

In [89]:
ax = sns.regplot(VA.div_index, VA.per_dem)
ax.set(xlabel = 'Diversity Index', ylabel = 'County Vote Percent Democrat(%)')
ax.set_title("Diversity's Contribution to Democratic Votes in Virginia Counties", fontsize=20)
plt.show()



In [90]:
ax = sns.regplot(VA.div_index, VA.per_gop)
ax.set(xlabel = 'Diversity Index', ylabel = 'County Vote Percent Republican(%)')
ax.set_title("Diversity's Contribution to Republican Votes in Virginia Counties", fontsize=20)
plt.show()



In [100]:
ax = sns.regplot(VA.est_pop_2015, VA.per_dem)
ax.set(xlabel = 'Population Per County', ylabel = 'County Vote Percent Democrats(%)')
ax.set_title("Population Level's Contribution to Democratic Votes in Virginia Counties", fontsize=20)
plt.show()
# Not much of a contribution at all in VAb



In [99]:
ax = sns.regplot(VA.pop_change_2015, VA.per_dem)
ax.set(xlabel = 'Population Change Per County', ylabel = 'County Vote Percent Democrats(%)')
ax.set_title("Population Change's Contribution to Democratic Votes in Virginia Counties", fontsize=20)
plt.show()
# Again, not much



In [105]:
ax = sns.regplot(VA.white_am, VA.per_dem)
ax.set(xlabel = 'Whites Per County(%)', ylabel = 'County Vote For Democrats(%)')
ax.set_title("White Americans' Contribution to Democratic Votes in Virginia Counties", fontsize=20)
plt.show()



In [104]:
ax = sns.regplot(VA.white_am, VA.per_gop)
ax.set(xlabel = 'Whites Per County(%)', ylabel = 'County Vote For Republicans(%)')
ax.set_title("White Americans' Contribution to Republican Votes in Virginia Counties", fontsize=20)
plt.show()



In [107]:
ax = sns.regplot(VA.af_am, VA.per_dem)
ax.set(xlabel = 'African Americans Per County(%)', ylabel = 'County Vote For Democrats(%)')
ax.set_title("African Americans' Contribution to Democratic Votes in Virginia Counties", fontsize=20)
plt.show()



In [108]:
ax = sns.regplot(VA.af_am, VA.per_gop)
ax.set(xlabel = 'African Americans Per County(%)', ylabel = 'County Vote For Republicans(%)')
ax.set_title("African Americans' Contribution to Republican Votes in Virginia Counties", fontsize=20)
plt.show()



In [110]:
ax = sns.regplot(VA.asian_am, VA.per_dem)
ax.set(xlabel = 'Asian Americans Per County(%)', ylabel = 'County Vote For Democrats(%)')
ax.set_title("Asian Americans' Contribution to Democratic Votes in Virginia Counties", fontsize=20)
plt.show()
#There's a slight correlation, but nothing too substantive. Need to id these specific counties.



In [ ]:
ax = sns.regplot(VA., VA.)
ax.set(xlabel = 'Asian Americans Per County(%)', ylabel = 'County Vote For Democrats(%)')
ax.set_title("Asian Americans' Contribution to Democratic Votes in Virginia Counties", fontsize=20)
plt.show()

In [85]:
# Making swing state list based on the crucial swing states this election.

IA = combined_5[combined_5['state_abbr']==('IA')]
WI = combined_5[combined_5['state_abbr']==('WI')]
MI = combined_5[combined_5['state_abbr']==('MI')]
PA = combined_5[combined_5['state_abbr']==('PA')]
FL = combined_5[combined_5['state_abbr']==('FL')]
NC = combined_5[combined_5['state_abbr']==('NC')]
OH = combined_5[combined_5['state_abbr']==('OH')]
MN = combined_5[combined_5['state_abbr']==('MN')]
swing_states= pd.concat([IA, WI, MI, PA, FL, NC, OH, MN])
# 'IA', 'WI','MI','PA','FL','NC','OH','MN'

In [86]:
swing_states.head()


Out[86]:
county_state state county est_pop_2015 pop_change_2015 int_mig_2015 dom_mig_2015 mig_2015 FIPS Code State ... per_point_diff state_abbr county_name_r election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop
4 Adair County, IA IA Adair County 7228 -189 0 -161 -161 19001.0 IA ... 35.36 IA Adair County -35.355148 False False False False False True
9 Adams County, IA IA Adams County 3796 -75 0 -80 -80 19003.0 IA ... 39.77 IA Adams County -39.769452 False False False False False True
40 Allamakee County, IA IA Allamakee County 13886 -175 21 -216 -195 19005.0 IA ... 24.32 IA Allamakee County -24.323534 False False False True False False
75 Appanoose County, IA IA Appanoose County 12529 -99 -2 -61 -63 19007.0 IA ... 36.38 IA Appanoose County -36.384514 False False False False False True
106 Audubon County, IA IA Audubon County 5773 -20 0 -19 -19 19009.0 IA ... 31.25 IA Audubon County -31.251850 False False False False False True

5 rows × 61 columns


In [87]:
ax = sns.distplot(swing_states.election_range, kde=False)
ax.set(xlabel = "Negative=Republican, Positive=Democrat (%)", ylabel='Count')
ax.set_title('Partisan Degree in All Swing State Counties, 2016', fontsize=16, fontname='Ubuntu')
plt.show()
# As expected, in swing states it's not AS bad for Democrats compared to the rest of the 
# country but still quite dire.


Influence of Ethnicity


In [96]:
ax = sns.regplot(combined_5.white_am, combined_5.per_dem)
ax.set(xlabel = 'Percentage White American(%)', ylabel = 'County Vote Percent Democrat(%)')
plt.show()



In [97]:
ax = sns.regplot(combined_5.white_am, combined_5.per_gop)
ax.set(xlabel = 'Percentage White American(%)', ylabel = 'County Vote Percent Republican(%)')
plt.show()
# It's scattered, but there is stil a strong correlation between percentage white 
# population and Republican vote.



In [98]:
ax = sns.regplot(combined_5.af_am, combined_5.per_dem)
ax.set(xlabel = 'Percentage African American(%)', ylabel = 'County Vote Percent Democrat(%)')
ax.set_title('African American Influence on 2016 Democrtic Vote in All US Counties', fontsize=15)
plt.show()



In [99]:
ax = sns.regplot(combined_5.af_am, combined_5.per_gop)
ax.set(xlabel = 'Percentage African American(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title('African American Influence on 2016 Republican Vote in All US Counties', fontsize=15)
plt.show()



In [100]:
ax = sns.regplot(combined_5.hisp_lat_am, combined_5.per_dem)
ax.set(xlabel = 'Percentage Hispanic/Latino(%)', ylabel = 'County Vote Percent Democrat(%)')
ax.set_title('Hispanic/Latino Influence on 2016 Democratic Vote in All US Counties', fontsize=15)
plt.show()



In [101]:
ax = sns.regplot(combined_5.hisp_lat_am, combined_5.per_gop)
ax.set(xlabel = 'Percentage Hispanic/Latino(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title('Hispanic/Latino Influence on 2016 Republican Vote in All US Counties', fontsize=15)
plt.show()
# A correlation is there, but it's not that strong due to the sheer amount of 
# counties with little hispanic/latino population.



In [102]:
ax = sns.regplot(combined_5.asian_am, combined_5.per_dem)
ax.set(xlabel = 'Percentage Asian American(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title('Asian American Influence on 2016 Democratic Vote in All US Counties', fontsize=15)
plt.show()



In [103]:
ax = sns.regplot(combined_5.asian_am, combined_5.per_gop)
ax.set(xlabel = 'Percentage Asian American(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title('Asian American Influence on 2016 Republican Vote in All US Counties', fontsize=15)
plt.show()



In [ ]:

Swing States


In [104]:
ax = sns.regplot(swing_states.div_index, swing_states.election_range)
ax.set(xlabel = 'Diversity Index', ylabel = 'Election Range, Neg=Republican, Pos=Democrat(%)')
ax.set_title("Diversity's Effect on Swing State Votes", fontsize=20, fontname='Ubuntu')
plt.show()



In [105]:
ax = sns.regplot(swing_states.div_index, swing_states.per_dem)
ax.set(xlabel = 'Diversity Index', ylabel = 'County Vote Percent Democrat(%)')
ax.set_title("Diversity's Effect on Democratic Vote in Swing States", fontsize=20, fontname='Ubuntu')
plt.show()



In [106]:
ax = sns.regplot(swing_states.div_index, swing_states.per_gop)
ax.set(xlabel = 'Diversity Index', ylabel = 'County Vote Percent Republican(%)')
ax.set_title("Diversity's Effect on Republican Vote in Swing States", fontsize=20, fontname='Ubuntu')
plt.show()



In [107]:
ax = sns.regplot(swing_states.ue_rate, swing_states.election_range)
ax.set(xlabel = 'Unemployment Rate(%)', ylabel = 'Election Range(%)')
plt.show()



In [108]:
# No discernable realtionship for unemployment in the swing states, just as in the overall dataset.

In [109]:
ax = sns.regplot(swing_states.white_am, swing_states.per_dem)
ax.set(xlabel = 'Percentage White American(%)', ylabel = 'County Vote Percent Democrat(%)')
ax.set_title("White Americans' Contribtuion to 2016 Swing State Democratic Vote", fontsize=16)
plt.show()



In [110]:
ax = sns.regplot(swing_states.white_am, swing_states.per_gop)
ax.set(xlabel = 'Percentage White American(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title("White Americans' Contribtuion to 2016 Swing State Republican Vote", fontsize=16)
plt.show()



In [111]:
# Look for how incomes of white americans influence how they vote.

In [112]:
ax = sns.regplot(swing_states.af_am, swing_states.per_dem)
ax.set(xlabel = 'Percentage African American(%)', ylabel = 'County Vote Percent Democrat(%)')
ax.set_title('African American Influence on 2016 Democratic Vote in Swing State Counties', fontsize=15)
plt.show()



In [113]:
ax = sns.regplot(swing_states.af_am, swing_states.per_gop)
ax.set(xlabel = 'Percentage African American(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title('African American Influence on 2016 Republican Vote in Swing State Counties', fontsize=15)
plt.show()



In [114]:
ax = sns.regplot(swing_states.hisp_lat_am, swing_states.per_dem)
ax.set(xlabel = 'Percentage Hispanic/Latino(%)', ylabel = 'County Vote Percent Democrat(%)')
plt.show()



In [115]:
# Again, a scattered, but string correlation.

In [116]:
# The change in the uninsured rate does not appear to have benefitted Democrats, 
# but does appear to have benefitted Republicans.

Influence of Education


In [117]:
edu.columns


Out[117]:
Index([                           u'FIPS Code',
                                      u'State',
                                  u'Area name',
                       u'less_hs_diploma_2000',
                       u'hs_diploma_only_2000',
                          u'less_4_years_2000',
                       u'four_or_ higher_2000',
          u'per_less_high_school diploma_2000',
                   u'per_hs_diploma_only_2000',
                      u'per_less_4_years_2000',
                   u'per_four_or_ higher_2000',
           u'less_high_school_diploma_2011_15',
                    u'hs_diploma_only_2011_15',
                       u'less_4_years_2011_15',
                    u'four_or_ higher_2011_15',
       u'per_less_high_school_diploma_2011_15',
                u'per_hs_diploma_only_2011_15',
                   u'per_less_4_years_2011_15',
                 u'per_four_or_higher_2011_15',
                               u'county_state'],
      dtype='object')

In [118]:
ax = sns.regplot(combined_5.per_hs_diploma_only_2011_15, combined_5.per_gop)
ax.set(xlabel = 'High School Diploma Only(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title("Lower Education's Contribution to 2016 Republican Vote in All US Counties", fontsize=16)
plt.show()



In [119]:
ax = sns.regplot(combined_5.per_four_or_higher_2011_15, combined_5.per_gop)
ax.set(xlabel = 'Four or more University Years(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title("Higher Education's Contribution to 2016 Republican Vote in All US Counties", fontsize=16)
plt.show()



In [120]:
ax = sns.regplot(combined_5.per_hs_diploma_only_2011_15, combined_5.per_dem)
ax.set(xlabel = 'High School Diploma Only(%)', ylabel = 'County Vote Percent Democrat(%)')
ax.set_title("Lower Education's Contribution to 2016 Democratic Vote", fontsize=16)
plt.show()



In [121]:
ax = sns.regplot(combined_5.per_four_or_higher_2011_15, combined_5.per_dem)
ax.set(xlabel = 'Four or more University Years(%)', ylabel = 'County Vote Percent Republican(%)')
ax.set_title("Higher Education's Contribution to 2016 Democratic Vote in All US Counties", fontsize=16)
plt.show()



In [122]:
ax = sns.regplot(combined_5.per_hs_diploma_only_2011_15, combined_5.election_range)
ax.set(xlabel = 'High School Diploma Only per County(%)', ylabel = 'Election Range (neg=Rep, pos=Dem, %)')
ax.set_title("Lower Education's Contribution to 2016 Vote", fontsize=16)
plt.show()



In [123]:
ax = sns.regplot(combined_5.per_four_or_higher_2011_15, combined_5.election_range)
ax.set(xlabel = 'Four or more University Years per County(%)', ylabel = 'Election Range (neg=Rep, pos=Dem, %)')
ax.set_title("Higher Education's Contribution to 2016 Vote in All US Counties", fontsize=16)
plt.show()


Swing States


In [124]:
ax = sns.regplot(swing_states.per_hs_diploma_only_2011_15, swing_states.election_range)
ax.set(xlabel = 'High School Diploma Only per County(%)', ylabel = 'Election Range (neg=Rep, pos=Dem, %)')
ax.set_title("Lower Education's Contribution to 2016 Vote in Swing State Counties", fontsize=16)
plt.show()



In [125]:
ax = sns.regplot(swing_states.per_four_or_higher_2011_15, swing_states.election_range)
ax.set(xlabel = 'Four or more University Years per County(%)', ylabel = 'Election Range (neg=Rep, pos=Dem, %)')
ax.set_title("Higher Education's Contribution to 2016 Vote in Swing State Counties", fontsize=16)
plt.show()



In [126]:
# If a county has a higher percentage of people with only a hs diploma, then more likely
# to vote Republican. If a county has a higher proportion of 4+ college degrees, then 
# more likely to go Democrat. Pretty much aligns with Nat Silver's argument.

In [127]:
combined_5.labor_force.head()


Out[127]:
0     10423.0
1     26186.0
2     15972.0
3    217281.0
4      4266.0
Name: labor_force, dtype: float64

Labor Force


In [128]:
ax = sns.regplot(combined_5.labor_force, combined_5.election_range)
ax.set(xlabel = 'Labor Force Body per County', ylabel = 'Election Range(neg=Rep, pos=Dem, %)')
ax.set_title("Labor Force Contribution to Votes in All counties", fontsize=16)
plt.show()


Population


In [129]:
combined_5.head(1)


Out[129]:
county_state state county est_pop_2015 pop_change_2015 int_mig_2015 dom_mig_2015 mig_2015 FIPS Code State ... per_point_diff state_abbr county_name_r election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop
0 Abbeville County, SC SC Abbeville County 24932 6 22 -12 10 45001.0 SC ... 28.25 SC Abbeville County -28.254383 False False False False False True

1 rows × 61 columns


In [130]:
ax = sns.regplot(combined_5.est_pop_2015, combined_5.election_range)
ax.set(xlabel = 'Popularion per County (2015)', ylabel = '2016 Election Range(neg=Rep, pos=Dem, %)')
ax.set_title("Population Contribution to Votes in All Counties", fontsize=16)
plt.show()



In [131]:
# Population size per county does correlate with vote.

In [132]:
ax = sns.regplot(combined_5.pop_change_2015, combined_5.election_range)                                                                                              
ax.set(xlabel = 'Population Change per County(2015)', ylabel = '2016 Election Range(neg=Rep, pos=Dem, %)')
ax.set_title("Population Change Contribution to Votes in All Counties", fontsize=16)
plt.show()



In [133]:
# Counties that experienced a positve change in population saw a boost for Dems.

In [134]:
# Although there is that cluster towards zero, and the correlation is broad, there
# is still something there.

Modeling

Regression

Most predictive features for counties' vote found through EDA:

(note that these variables, sometimes by their nature, don't necessarily follow a normal distribution)

Percentage White American population

Percentage African American population

Percentage Asian American population

Percentage High School Diploma only

Percentage Four or more years of University


In [135]:
combined_5.columns


Out[135]:
Index([                        u'county_state',
                                      u'state',
                                     u'county',
                               u'est_pop_2015',
                            u'pop_change_2015',
                               u'int_mig_2015',
                               u'dom_mig_2015',
                                   u'mig_2015',
                                  u'FIPS Code',
                                      u'State',
                                  u'Area name',
                       u'less_hs_diploma_2000',
                       u'hs_diploma_only_2000',
                          u'less_4_years_2000',
                       u'four_or_ higher_2000',
          u'per_less_high_school diploma_2000',
                   u'per_hs_diploma_only_2000',
                      u'per_less_4_years_2000',
                   u'per_four_or_ higher_2000',
           u'less_high_school_diploma_2011_15',
                    u'hs_diploma_only_2011_15',
                       u'less_4_years_2011_15',
                    u'four_or_ higher_2011_15',
       u'per_less_high_school_diploma_2011_15',
                u'per_hs_diploma_only_2011_15',
                   u'per_less_4_years_2011_15',
                 u'per_four_or_higher_2011_15',
                                  u'div_index',
                                      u'af_am',
                                u'native_2013',
                                   u'asian_am',
                                     u'pac_am',
                          u'two_or_more_races',
                                u'hisp_lat_am',
                                   u'white_am',
                                u'county_fips',
                                u'county_name',
                               u'state_abbrev',
                               u'2013_ui_rate',
                               u'2016_ui_rate',
                                   u'ui_delta',
                                u'labor_force',
                                   u'employed',
                                 u'unemployed',
                                    u'ue_rate',
                                  u'votes_dem',
                                  u'votes_gop',
                                u'total_votes',
                                    u'per_dem',
                                    u'per_gop',
                                       u'diff',
                             u'per_point_diff',
                                 u'state_abbr',
                              u'county_name_r',
                             u'election_range',
                                 u'slight_dem',
                                 u'slight_gop',
                                    u'med_dem',
                                    u'med_gop',
                                 u'strong_dem',
                                 u'strong_gop'],
      dtype='object')

In [136]:
modeling = combined_5.drop(combined_5[[0,1,2,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,34,35,36,37,38,39,40,52,53]], axis=1)

In [137]:
modeling.head()


Out[137]:
est_pop_2015 pop_change_2015 per_hs_diploma_only_2011_15 per_four_or_higher_2011_15 div_index af_am native_2013 asian_am pac_am two_or_more_races ... per_gop diff per_point_diff election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop
0 24932 6 37.5 12.3 0.445417 28.2 0.3 0.4 0.0 1.3 ... 62.868333 3,030 28.25 -28.254383 False False False False False True
1 62577 79 39.2 10.5 0.355956 18.3 0.3 0.4 0.0 1.3 ... 77.262105 15,521 56.67 -56.674943 False False False False False False
2 32973 -25 39.9 18.8 0.539878 28.0 0.6 0.6 0.2 1.5 ... 54.471596 1,845 11.71 -11.710568 False False False True False False
3 434211 7364 21.4 37.1 0.256622 1.3 0.8 2.6 0.2 2.6 ... 47.931611 18,072 9.24 -9.239878 False True False False False False
4 7228 -189 44.7 15.3 0.054921 0.2 0.1 0.4 0.0 0.7 ... 65.336526 1,329 35.36 -35.355148 False False False False False True

5 rows × 29 columns


In [138]:
modeling.isnull().sum()


Out[138]:
est_pop_2015                   0
pop_change_2015                0
per_hs_diploma_only_2011_15    0
per_four_or_higher_2011_15     0
div_index                      0
af_am                          0
native_2013                    0
asian_am                       0
pac_am                         0
two_or_more_races              0
hisp_lat_am                    0
labor_force                    0
employed                       0
unemployed                     0
ue_rate                        0
votes_dem                      0
votes_gop                      0
total_votes                    0
per_dem                        0
per_gop                        0
diff                           0
per_point_diff                 0
election_range                 0
slight_dem                     0
slight_gop                     0
med_dem                        0
med_gop                        0
strong_dem                     0
strong_gop                     0
dtype: int64

In [139]:
modeling.dropna(inplace=True)
#Only 46 isn't too significant.

In [140]:
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import confusion_matrix, mean_squared_error

In [141]:
lr = LinearRegression()

In [142]:
modeling.columns


Out[142]:
Index([               u'est_pop_2015',             u'pop_change_2015',
       u'per_hs_diploma_only_2011_15',  u'per_four_or_higher_2011_15',
                         u'div_index',                       u'af_am',
                       u'native_2013',                    u'asian_am',
                            u'pac_am',           u'two_or_more_races',
                       u'hisp_lat_am',                 u'labor_force',
                          u'employed',                  u'unemployed',
                           u'ue_rate',                   u'votes_dem',
                         u'votes_gop',                 u'total_votes',
                           u'per_dem',                     u'per_gop',
                              u'diff',              u'per_point_diff',
                    u'election_range',                  u'slight_dem',
                        u'slight_gop',                     u'med_dem',
                           u'med_gop',                  u'strong_dem',
                        u'strong_gop'],
      dtype='object')

In [143]:
X = modeling[[0,1,2,3,4,5,6,7,8,9,10,11]] 
y = modeling['election_range']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

In [144]:
X.head(0)


Out[144]:
est_pop_2015 pop_change_2015 per_hs_diploma_only_2011_15 per_four_or_higher_2011_15 div_index af_am native_2013 asian_am pac_am two_or_more_races hisp_lat_am labor_force

In [145]:
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)

In [146]:
ax = sns.regplot(y_test, y_pred)
ax.set(xlabel = 'Predicted Election Range (neg=Rep, pos=Dem)', ylabel = 'Actual Election Range(neg=Rep, pos=Dem)')
ax.set_title("Predicted vs. Actual Election Ranges for All Counties", fontsize=16)
plt.show()



In [147]:
lr.score(X_train, y_train)


Out[147]:
0.66806575723520978

Model Swing States


In [246]:
s_modeling = swing_states.drop(swing_states[[0,1,2,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,34,35,36,37,38,39,40,52,53]], axis=1)

In [247]:
swing_states.head(0)


Out[247]:
county_state state county est_pop_2015 pop_change_2015 int_mig_2015 dom_mig_2015 mig_2015 FIPS Code State ... per_point_diff state_abbr county_name_r election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop

0 rows × 61 columns


In [248]:
s_modeling.head(0)


Out[248]:
est_pop_2015 pop_change_2015 per_hs_diploma_only_2011_15 per_four_or_higher_2011_15 div_index af_am native_2013 asian_am pac_am two_or_more_races ... per_gop diff per_point_diff election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop

0 rows × 29 columns


In [249]:
X = s_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]] 
y = s_modeling['election_range']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

In [250]:
X.head()


Out[250]:
est_pop_2015 pop_change_2015 per_hs_diploma_only_2011_15 per_four_or_higher_2011_15 div_index af_am native_2013 asian_am pac_am two_or_more_races hisp_lat_am labor_force
4 7228 -189 44.7 15.3 0.054921 0.2 0.1 0.4 0.0 0.7 1.5 4266.0
9 3796 -75 39.1 15.1 0.058873 0.3 0.5 0.6 0.0 0.6 1.1 2300.0
40 13886 -175 42.1 16.3 0.159016 1.5 0.6 0.5 0.3 1.0 5.8 7727.0
75 12529 -99 36.3 17.6 0.074125 0.6 0.3 0.3 0.0 1.1 1.6 6255.0
106 5773 -20 42.3 14.3 0.049200 0.4 0.2 0.5 0.0 0.7 0.9 3251.0

In [251]:
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)

In [253]:
ax = sns.regplot(y_test, y_pred)
ax.set(xlabel = 'Predicted Election Range (neg=Rep, pos=Dem)', ylabel = 'Actual Election Range(neg=Rep, pos=Dem)')
ax.set_title("Predicted vs. Actual Election Ranges for Swing State Counties", fontsize=16)
plt.show()



In [254]:
lr.score(X_train, y_train)
# Right around the same R^2 score as all counties.


Out[254]:
0.66209473643341399

Classification

Now we want to see what features classify a county into being "slight dem", "slight gop, "med_dem", "med_gop", "strong_dem", and "strong_gop."


In [206]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, precision_score, recall_score, roc_curve, auc

In [159]:
# Setting the number of neighbors to the square root of number of instances is a good 
# rule of thumb.
knn = KNeighborsClassifier(n_neighbors = 55)
rfc = RandomForestClassifier(max_depth = 5)

In [160]:
modeling.head()


Out[160]:
est_pop_2015 pop_change_2015 per_hs_diploma_only_2011_15 per_four_or_higher_2011_15 div_index af_am native_2013 asian_am pac_am two_or_more_races ... per_gop diff per_point_diff election_range slight_dem slight_gop med_dem med_gop strong_dem strong_gop
0 24932 6 37.5 12.3 0.445417 28.2 0.3 0.4 0.0 1.3 ... 62.868333 3,030 28.25 -28.254383 False False False False False True
1 62577 79 39.2 10.5 0.355956 18.3 0.3 0.4 0.0 1.3 ... 77.262105 15,521 56.67 -56.674943 False False False False False False
2 32973 -25 39.9 18.8 0.539878 28.0 0.6 0.6 0.2 1.5 ... 54.471596 1,845 11.71 -11.710568 False False False True False False
3 434211 7364 21.4 37.1 0.256622 1.3 0.8 2.6 0.2 2.6 ... 47.931611 18,072 9.24 -9.239878 False True False False False False
4 7228 -189 44.7 15.3 0.054921 0.2 0.1 0.4 0.0 0.7 ... 65.336526 1,329 35.36 -35.355148 False False False False False True

5 rows × 29 columns


In [161]:


In [162]:
c_modeling = modeling.join(dummies)
c_modeling = c_modeling.reset_index()
c_modeling = c_modeling.drop(c_modeling[[0]], axis=1)

In [163]:
c_modeling.head()


Out[163]:
est_pop_2015 pop_change_2015 per_hs_diploma_only_2011_15 per_four_or_higher_2011_15 div_index af_am native_2013 asian_am pac_am two_or_more_races ... slight_gop_False slight_gop_True med_dem_False med_dem_True med_gop_False med_gop_True strong_dem_False strong_dem_True strong_gop_False strong_gop_True
0 24932 6 37.5 12.3 0.445417 28.2 0.3 0.4 0.0 1.3 ... 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
1 62577 79 39.2 10.5 0.355956 18.3 0.3 0.4 0.0 1.3 ... 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
2 32973 -25 39.9 18.8 0.539878 28.0 0.6 0.6 0.2 1.5 ... 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0
3 434211 7364 21.4 37.1 0.256622 1.3 0.8 2.6 0.2 2.6 ... 0.0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0
4 7228 -189 44.7 15.3 0.054921 0.2 0.1 0.4 0.0 0.7 ... 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0

5 rows × 41 columns


In [164]:
c_modeling.columns


Out[164]:
Index([               u'est_pop_2015',             u'pop_change_2015',
       u'per_hs_diploma_only_2011_15',  u'per_four_or_higher_2011_15',
                         u'div_index',                       u'af_am',
                       u'native_2013',                    u'asian_am',
                            u'pac_am',           u'two_or_more_races',
                       u'hisp_lat_am',                 u'labor_force',
                          u'employed',                  u'unemployed',
                           u'ue_rate',                   u'votes_dem',
                         u'votes_gop',                 u'total_votes',
                           u'per_dem',                     u'per_gop',
                              u'diff',              u'per_point_diff',
                    u'election_range',                  u'slight_dem',
                        u'slight_gop',                     u'med_dem',
                           u'med_gop',                  u'strong_dem',
                        u'strong_gop',            u'slight_dem_False',
                   u'slight_dem_True',            u'slight_gop_False',
                   u'slight_gop_True',               u'med_dem_False',
                      u'med_dem_True',               u'med_gop_False',
                      u'med_gop_True',            u'strong_dem_False',
                   u'strong_dem_True',            u'strong_gop_False',
                   u'strong_gop_True'],
      dtype='object')

In [255]:
# Swing State Classifiers
dummies = pd.get_dummies(s_modeling[['slight_dem','slight_gop','med_dem','med_gop','strong_dem','strong_gop']])
cs_modeling = s_modeling.join(dummies)
cs_modeling = cs_modeling.reset_index()
cs_modeling = cs_modeling.drop(c_modeling[[0]], axis=1)

First test for slight dem and slight gop.


In [265]:
# First try KNN for just slight dem and slight gop.
X = c_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = c_modeling[[29,30,31,32]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [266]:
X.head()


Out[266]:
est_pop_2015 pop_change_2015 per_hs_diploma_only_2011_15 per_four_or_higher_2011_15 div_index af_am native_2013 asian_am pac_am two_or_more_races hisp_lat_am labor_force
0 24932 6 37.5 12.3 0.445417 28.2 0.3 0.4 0.0 1.3 1.2 10423.0
1 62577 79 39.2 10.5 0.355956 18.3 0.3 0.4 0.0 1.3 2.0 26186.0
2 32973 -25 39.9 18.8 0.539878 28.0 0.6 0.6 0.2 1.5 9.0 15972.0
3 434211 7364 21.4 37.1 0.256622 1.3 0.8 2.6 0.2 2.6 7.5 217281.0
4 7228 -189 44.7 15.3 0.054921 0.2 0.1 0.4 0.0 0.7 1.5 4266.0

In [267]:
y.head()


Out[267]:
slight_dem_False slight_dem_True slight_gop_False slight_gop_True
0 1.0 0.0 1.0 0.0
1 1.0 0.0 1.0 0.0
2 1.0 0.0 1.0 0.0
3 1.0 0.0 0.0 1.0
4 1.0 0.0 1.0 0.0

In [268]:
knn.fit(X_train, y_train)


Out[268]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=55, p=2,
           weights='uniform')

In [269]:
y_pred = knn.predict(X_test)

In [270]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.892871526379
0.901771336554
[ 0.90342052  0.90140845  0.8832998   0.875       0.90120968]
             precision    recall  f1-score   support

          0       0.95      1.00      0.98       591
          1       0.00      0.00      0.00        30
          2       0.95      1.00      0.97       590
          3       0.00      0.00      0.00        31

avg / total       0.90      0.95      0.93      1242


In [ ]:

Now test for medium gop and medium dem.


In [215]:
#KNN for med_dem and med_gop
X = c_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = c_modeling[[33,34,35,36]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
y_pred = knn.predict(X_test)

In [216]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.826016915022
0.811594202899
[ 0.82293763  0.81891348  0.81488934  0.83870968  0.83467742]
             precision    recall  f1-score   support

          0       0.95      1.00      0.97       589
          1       0.00      0.00      0.00        32
          2       0.86      1.00      0.93       536
          3       0.00      0.00      0.00        85

avg / total       0.82      0.91      0.86      1242

Now test for strong gop and strong dem.


In [ ]:
#KNN for strong dem and stronggop
X = c_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = c_modeling[[37,38,39,40]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
y_pred = knn.predict(X_test)

In [218]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.627064035441
0.631239935588
[ 0.56740443  0.64788732  0.64788732  0.64717742  0.60685484]
             precision    recall  f1-score   support

          0       0.95      1.00      0.98       593
          1       0.00      0.00      0.00        28
          2       0.68      1.00      0.81       420
          3       0.00      0.00      0.00       201

avg / total       0.68      0.82      0.74      1242

Swing States Classifiers


In [256]:
#First slight dem and slight gop
X = cs_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = cs_modeling[[29,30,31,32]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
knn.fit(X_train, y_train)


Out[256]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=55, p=2,
           weights='uniform')

In [257]:
y_pred = knn.predict(X_test)

In [258]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.856332703214
0.87969924812
[ 0.85849057  0.87735849  0.83962264  0.86792453  0.83809524]
             precision    recall  f1-score   support

          0       0.93      1.00      0.96       124
          1       0.00      0.00      0.00         9
          2       0.95      1.00      0.97       126
          3       0.00      0.00      0.00         7

avg / total       0.88      0.94      0.91       266

Medium Dem and GOP


In [259]:
#KNN for med_dem and med_gop
X = cs_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = cs_modeling[[33,34,35,36]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
y_pred = knn.predict(X_test)

In [260]:
knn.fit(X_train,y_train)


Out[260]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=55, p=2,
           weights='uniform')

In [261]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.744801512287
0.781954887218
[ 0.80188679  0.69811321  0.74528302  0.73584906  0.74285714]
             precision    recall  f1-score   support

          0       0.98      1.00      0.99       130
          1       0.00      0.00      0.00         3
          2       0.80      1.00      0.89       107
          3       0.00      0.00      0.00        26

avg / total       0.80      0.89      0.84       266

Strong Dem and GOP


In [262]:
X = cs_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = cs_modeling[[37,38,39,40]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
y_pred = knn.predict(X_test)
knn.fit(X_train,y_train)


Out[262]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=55, p=2,
           weights='uniform')

In [263]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.576559546314
0.428571428571
[ 0.47169811  0.59433962  0.5754717   0.56603774  0.59047619]
             precision    recall  f1-score   support

          0       0.93      1.00      0.96       124
          1       0.00      0.00      0.00         9
          2       0.50      1.00      0.66        66
          3       0.00      0.00      0.00        67

avg / total       0.56      0.71      0.61       266

Modeling for the "strong" counties of 25-50% is not that predictive.


In [178]:
## Random Forests

RFC for slight dem and slight gop


In [224]:
X = c_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = c_modeling[[29,30,31,32]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [225]:
rfc.fit(X_train, y_train)


Out[225]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [226]:
y_pred = rfc.predict(X_test)

In [227]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.892871526379
0.901771336554
[ 0.90342052  0.90140845  0.8832998   0.875       0.90120968]
             precision    recall  f1-score   support

          0       0.95      1.00      0.98       591
          1       0.00      0.00      0.00        30
          2       0.95      1.00      0.97       590
          3       0.00      0.00      0.00        31

avg / total       0.90      0.95      0.93      1242

RFC for medium dem and medium gop


In [228]:
X = c_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = c_modeling[[33,34,35,36]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [229]:
rfc.fit(X_train, y_train)


Out[229]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [230]:
y_pred = rfc.predict(X_test)

In [231]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.826016915022
0.80998389694
[ 0.82293763  0.81891348  0.81488934  0.83870968  0.83467742]
             precision    recall  f1-score   support

          0       0.95      1.00      0.97       589
          1       0.00      0.00      0.00        32
          2       0.86      1.00      0.93       536
          3       1.00      0.01      0.02        85

avg / total       0.89      0.90      0.86      1242

RFC for strong dem and strong gop


In [232]:
X = c_modeling[[0,1,2,3,4,5,6,7,8,9,10,11]]
y = c_modeling[[37,38,39,40]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [233]:
rfc.fit(X_train, y_train)


Out[233]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [234]:
y_pred = rfc.predict(X_test)

In [235]:
print knn.score(X_train,y_train)
print accuracy_score(y_test, y_pred)
print cross_val_score(knn, X_train, y_train, cv=5)
print(classification_report(y_test,y_pred))


0.627064035441
0.642512077295
[ 0.56740443  0.64788732  0.64788732  0.64717742  0.60685484]
             precision    recall  f1-score   support

          0       0.95      1.00      0.98       593
          1       0.00      0.00      0.00        28
          2       0.69      0.98      0.81       420
          3       0.64      0.08      0.14       201

avg / total       0.79      0.82      0.76      1242


In [191]:
# Just like in KNN, not the best classifier for "strong counties."

In [192]:
## Problem statement: What are the economic and demographic factors we can use to predict
## whether a county votes Democrat or Republican? More specifically, how do these factors 
## affect the margin of a Democrat or Republican winning the vote in a swing state county?
## Furthermore, are the parties becoming racial identity parties--how much does the data 
## convey this?

In [193]:
## Potential questions down the line:
## Look closely at the election week's coverage and how to build off that
## How many misleading data driven stories have their been? Atlantic--said most predicitive 
## question was whether Obama was born here (bunch of false positives)==precision vs recall 
## problem. Look at HOW METRICS HAVE BEEN ABUSED. 

## DEBUNK these stories.